xxxxxxxxxx# Motivation### What is your dataset?The dataset used in this notebook is the Camnaught UFO sighting dataset downloaded from kaggle: https://www.kaggle.com/datasets/camnugent/ufo-sightings-around-the-world?resource=download The dataset contains over 80.000 UFO sightings with up to 11 variables connected to the sighting. Hereunder, Time of observation, Description of aircraft, Duration of encounter and Location of sighting. ### Why did you choose this/these particular dataset(s)?There is really only a handfull of other datasets concerning UFO sightings. All the reviewed datasets were scraped from the same website (https://github.com/planetsig/ufo-reports). This particular dataset is the eldest of the bunch with the cleanest data, hence the final descision to use this dataset. The reasoning for working with UFO sightings, is partially due to interest in the subject, but more importantly, the compatibility of the data and the course. The data is suitable for all sorts of visualizations; Map visualizations, Interactive exploration and the data is periodic. ### What was your goal for the end user's experience?We want the user to discover suggestions as to why UFO's are sighted more than ever in todays age. Moreover suggestions as to why the sightings are heavily location-bound. When the end user has opened their eyes to different viewpoints, the intention is for the user to explore the data independently and form their own oppinions on the matter. # Basic stats. Let's understand the dataset better### Write about your choices in data cleaning and preprocessingNot much datacleaning has been done on this dataset. As mentioned in the previous section, this particular dataset is the cleanest of the bunch available. Other datasets had random spaces and varying formats for datetime. There are however several NaN values present in multiple of the columns. It was decided to keep the rows containing these NaN values for any analysis that did not directly handle the specific columns where the values were present. However, when the rows are directly calculated with, the NaNs are excluded from calculation. Some columns had varying datatypes by default. We decided to convert these columns to string columns initially. This way, it was simpler to treat all rows equally, this is particularly usefull when doing row operations which require a single datatype. ### Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis.In this dataset, there are over 80.000 observations scattered over the entire globe. The first task in the investigation of UFO's was to gain general knowledge of global trends and then narrow down the scope. Firstly, it was discovered that around 92% of the observations happened after the year 1990. Furthermore roughly 75% of observations were in the USA after the year 1990. It was decided to further investigate the USA after 1990 as this is where the vast majority of the observations lie. The final narrowing of the datamass happened when we discovered that California had way more observations than any other state. To put this into perspective: Washington state, which is the state with the second most sightings has 3777 in the period 1990 to 2014, while California, which is number 1 on the list, has 8301 observations. # Data Analysis### Describe your data analysis and explain what you've learned about the dataset.The investigation this analysis has undergone is especially tricky in terms of findings. We of course knew this going into the project. On one hand, we can view all of these UFO sightings as visits from Aliens. On the other hand, every single observation in this dataset is a human error. After all, UFOs are widely aknowledged to have never visited earth. If the latter is to be believed, then a great deal of randomness and unpredictability has been invited into the data. The analysis has been very exploratory and corious due to the nature of the data. It often raises more questions than it answers, as it often aims to rationalize UFO sightings with other factors such as druguse and time of year. The analysis gave a great insight # Genre * Which genre of data story did you use?* Which tools did you use from each of the 3 categories of Visual Narrative (Figure 7 in Segal and Heer). Why?* Which tools did you use from each of the 3 categories of Narrative Structure (Figure 7 in Segal and Heer). Why?# Visualizations* Explain the visualizations you've chosen.* Why are they right for the story you want to tell?# Discussion * Think critically about your creation* What went well?,* What is still missing? What could be improved?, Why?# Contributions * Who did what?* You should write (just briefly) which group member was the main responsible for which elements of the assignment. (I want you guys to understand every part of the assignment, but usually there is someone who took lead role on certain portions of the work. That's what you should explain).* It is not OK simply to write "All group members contributed equally".* Make sure that you use references when they're needed and follow academic standards.The dataset used in this notebook is the Camnaught UFO sighting dataset downloaded from kaggle: https://www.kaggle.com/datasets/camnugent/ufo-sightings-around-the-world?resource=download The dataset contains over 80.000 UFO sightings with up to 11 variables connected to the sighting. Hereunder, Time of observation, Description of aircraft, Duration of encounter and Location of sighting.
There is really only a handfull of other datasets concerning UFO sightings. All the reviewed datasets were scraped from the same website (https://github.com/planetsig/ufo-reports). This particular dataset is the eldest of the bunch with the cleanest data, hence the final descision to use this dataset. The reasoning for working with UFO sightings, is partially due to interest in the subject, but more importantly, the compatibility of the data and the course. The data is suitable for all sorts of visualizations; Map visualizations, Interactive exploration and the data is periodic.
We want the user to discover suggestions as to why UFO's are sighted more than ever in todays age. Moreover suggestions as to why the sightings are heavily location-bound. When the end user has opened their eyes to different viewpoints, the intention is for the user to explore the data independently and form their own oppinions on the matter.
Not much datacleaning has been done on this dataset. As mentioned in the previous section, this particular dataset is the cleanest of the bunch available. Other datasets had random spaces and varying formats for datetime. There are however several NaN values present in multiple of the columns. It was decided to keep the rows containing these NaN values for any analysis that did not directly handle the specific columns where the values were present. However, when the rows are directly calculated with, the NaNs are excluded from calculation.
Some columns had varying datatypes by default. We decided to convert these columns to string columns initially. This way, it was simpler to treat all rows equally, this is particularly usefull when doing row operations which require a single datatype.
In this dataset, there are over 80.000 observations scattered over the entire globe. The first task in the investigation of UFO's was to gain general knowledge of global trends and then narrow down the scope. Firstly, it was discovered that around 92% of the observations happened after the year 1990. Furthermore roughly 75% of observations were in the USA after the year 1990. It was decided to further investigate the USA after 1990 as this is where the vast majority of the observations lie.
The final narrowing of the datamass happened when we discovered that California had way more observations than any other state. To put this into perspective: Washington state, which is the state with the second most sightings has 3777 in the period 1990 to 2014, while California, which is number 1 on the list, has 8301 observations.
The investigation this analysis has undergone is especially tricky in terms of findings. We of course knew this going into the project. On one hand, we can view all of these UFO sightings as visits from Aliens. On the other hand, every single observation in this dataset is a human error. After all, UFOs are widely aknowledged to have never visited earth. If the latter is to be believed, then a great deal of randomness and unpredictability has been invited into the data.
The analysis has been very exploratory and corious due to the nature of the data. It often raises more questions than it answers, as it often aims to rationalize UFO sightings with other factors such as druguse and time of year.
The analysis gave a great insight
xxxxxxxxxximport pandas as pddf = pd.read_csv("ufo_sighting_data.csv",dtype = {'Date_time':str,'length_of_encounter_seconds': str, 'latitude': str, 'state/province': str})dfxxxxxxxxxxdf = df.drop_duplicates()print(len(df))xxxxxxxxxximport numpy as npdf["YD"] = [df["Date_time"][x].split('/')[2] for x in range(len(df))]df["Year"] = [int(df["YD"][x].split(' ')[0]) for x in range(len(df))]xxxxxxxxxxdf_1990 = df[df["Year"] >= 1950]len(df)xxxxxxxxxxnp.histogram(df["Year"])min(df["Year"])xxxxxxxxxximport matplotlib.pyplot as pltimport numpy as npimport seaborn as snssns.set()fig = plt.figure(figsize =(10, 7))hist, bins = np.histogram(df_1990["Year"], bins = 64)plt.title("UFO sightings per year")plt.xlabel("Year")plt.ylabel("UFO sightings")plt.bar(bins[:-1],hist, color = 'green')plt.show()xxxxxxxxxxdf_us = df[df["country"] == "us"]xxxxxxxxxxdf_us = df_us.reset_index(drop=True)df_usxxxxxxxxxxdf_stategrp = pd.DataFrame(df_us.groupby(['state/province'],as_index=False)['city'].count().sort_values("city"))df_stategrp['state/province'] = df_stategrp['state/province'].str.upper()xxxxxxxxxximport plotly.graph_objects as gofig = go.Figure(data=go.Choropleth( locations=df_stategrp['state/province'], z = df_stategrp['city'].astype(float), locationmode = 'USA-states', colorscale = 'Greens', colorbar_title = "Sightings",))fig.update_layout( title_text = 'UFO sightings by state', geo_scope='usa', )fig.show()xxxxxxxxxximport seaborn as snsdf_us['state/province'] = df_us['state/province'].str.upper()plt.figure(figsize=(20, 7))top_state = df_us['state/province'].value_counts().indexsns.countplot(data=df_us, x='state/province', order=top_state, palette = sns.cubehelix_palette(start=2, rot=0, dark=.3, light=.8, reverse=True, n_colors = 52))plt.title("UFO sightings in the USA from 1990 to 2013 by state")plt.xlabel("States")plt.ylabel("Total sightings")xxxxxxxxxximport repop_edu= pd.read_csv("pop_edu.txt", sep="\t")for index_p, row_p in pop_edu.iterrows(): row_p['Population over the age of 25'] = int(re.sub(",", "", row_p['Population over the age of 25']))x
ufo_pop = []for index_p, row_p in pop_edu.iterrows(): for index_g, row_g in df_stategrp.iterrows(): if row_p['State'] == row_g['state/province']: ufo_pop.append("%.0f"%((row_g['city']/row_p['Population over the age of 25'])*1000000)) pop_edu['Ufos seen per million people'] = ufo_pop xxxxxxxxxxpop_edufig = go.Figure(data=go.Choropleth( locations=pop_edu['State'], z = pop_edu['Ufos seen per million people'].astype(float), locationmode = 'USA-states', colorscale = 'Greens', colorbar_title = "Sightings per million people",))fig.update_layout( title_text = 'UFO sightings by state', geo_scope='usa', )fig.show()xxxxxxxxxxdf_ca = df_us[df_us["state/province"] == "CA"]xxxxxxxxxximport foliumfrom folium import pluginsfrom folium.plugins import HeatMapheat_map = folium.Map(location=[36.778259,-119.417931],zoom_start = 6)heat_data = [[row['latitude'], row['longitude']] for index, row in df_ca.iterrows()]HeatMap(heat_data,gradient={0.1: 'blue', 0.3: 'lime', 0.5: 'yellow', 0.7: 'orange', 1: 'red'}, min_opacity=0.05, max_opacity=0.9, radius=25, blur = 15, use_local_extrema=False).add_to(heat_map)heat_mapxxxxxxxxxxdf_us['Date'] = [pd.to_datetime(df_us["Date_time"][x].split(' ')[0]) for x in range(len(df_us))]df_us['Date']x
date_count = pd.DataFrame(df_us.groupby(['Date','state/province'])['state/province'].size().unstack(level=1))events = pd.Series(date_count['CA'])events = events.fillna(0)eventsxxxxxxxxxxdf_ca = df_ca.reset_index(drop=True)df_ca['Date'] = [df_ca["Date_time"][x].split(' ')[0] for x in range(len(df_ca))]df_ca['Date']xxxxxxxxxxshape_date = pd.DataFrame(df_ca.groupby(['Date','UFO_shape'])['UFO_shape'].count().unstack(level=1))shape_date = shape_date.fillna(0)shape_datexxxxxxxxxxshape_dict = {}for index, row in shape_date.iterrows(): count = 0 for val in row: if val > 5.0: shape_dict[str(index)] = [row.index[count], val] count = count + 1shape_dictxxxxxxxxxxcolors = [ 'darkred', 'blue', 'gray', 'orange', 'beige', 'green', 'purple', 'cadetblue', 'pink']x
icon_path = r"C:\Users\Mikkel\Desktop\2. semester\Tirsdag - 02806 Social data analysis and visualization\Untitled Folder\ufo.png"multiple_sightings_map = folium.Map(location=[36.778259,-119.417931],zoom_start = 6)count = 0for key in shape_dict: for index, row in df_ca.iterrows(): if row["Date_time"].split(' ')[0] == key.split(' ')[0] and row["UFO_shape"] == shape_dict[key][0]: #print(colors[count]) html = """\ <html> <body> <p> <b>Time:</b>{time} <br><b>Shape:</b>{shape}<br> <br>{desc}<br> </p> </body> </html> """.format(time=row["Date_time"], shape=row["UFO_shape"],desc=row['description']) iframe = folium.IFrame(html, width=180, height=200) popup = folium.Popup(iframe, max_width=300, max_height = 500) ufo_icon = folium.features.CustomIcon(icon_path, icon_size=(30,30)) marker = folium.Marker( location= [row['latitude'], row['longitude']], icon = ufo_icon, color = colors[count], popup=popup).add_to(multiple_sightings_map) folium.Circle(location=[row['latitude'], row['longitude']], fill_color = colors[count], radius=100, weight=10, color=colors[count]).add_to(multiple_sightings_map) count = count + 1multiple_sightings_mapxxxxxxxxxxicon_path = r"C:\Users\Mikkel\Desktop\2. semester\Tirsdag - 02806 Social data analysis and visualization\Untitled Folder\ufo.png"multiple_sightings_map = folium.Map(location=[36.778259,-119.417931],zoom_start = 6)count = 0for key in shape_dict: for index, row in df_ca.iterrows(): html = """\ <html> <body> <p> <b>Time:</b>{time} <br><b>Shape:</b>{shape}<br> <br>{desc}<br> </p> </body> </html> """.format(time=row["Date_time"], shape=row["UFO_shape"],desc=row['description']) iframe = folium.IFrame(html, width=180, height=200) popup = folium.Popup(iframe, max_width=300, max_height = 500) ufo_icon = folium.features.CustomIcon(icon_path, icon_size=(30,30)) marker = folium.Marker( location= [row['latitude'], row['longitude']], icon = ufo_icon, color = 'green', popup=popup).add_to(multiple_sightings_map) folium.Circle(location=[row['latitude'], row['longitude']], fill_color = 'green', radius=100, weight=10, color='green').add_to(multiple_sightings_map) count = count + 1multiple_sightings_mapCA_map = folium.Map(location=[36.778259,-119.417931],zoom_start = 6)for index, row in df_ca.iterrows(): folium.CircleMarker([row['latitude'], row['longitude']], radius=1, popup=(row['Date_time'],row['UFO_shape'],row['described_duration_of_encounter']), color='green', ).add_to(CA_map)CA_mapxxxxxxxxxxsighting_date = pd.DataFrame(df_us.groupby(['Year','state/province'])['state/province'].count().unstack(level=1))sighting_date = sighting_date.fillna(0)sighting_datexxxxxxxxxxyears = df_us['Year'].unique()years = list(map(str, years))xxxxxxxxxxsighting_date.index = yearssighting_date.index.name = 'years'xxxxxxxxxxsighting_date.indexxxxxxxxxxxdf_pop = pd.read_csv("apportionment.csv")df_popxxxxxxxxxxdf_pop = df_pop[df_pop['Geography Type'] == 'State']df_pop_rel = df_pop[['Name','Year','Resident Population']].copy()xxxxxxxxxxyears_pop =list(df_pop['Year'].unique())years_pop = list(map(str, years_pop))xxxxxxxxxxdf_pop_rel_T = df_pop_rel.groupby('Name')['Resident Population'].apply(lambda df_pop_rel: df_pop_rel.reset_index(drop=True)).unstack().reset_index()df_pop_rel_T['Name'] = sighting_date.columnsdf_pop_rel_T = df_pop_rel_T.set_index('Name')df_pop_rel_T.columns = years_popdf_pop_rel_T = df_pop_rel_T.transpose()df_pop_rel_Txxxxxxxxxxfrom bokeh.palettes import Dark2_5from bokeh.models import ColumnDataSource, FactorRangefrom bokeh.models import Legendfrom bokeh.transform import factor_cmapfrom bokeh.plotting import figure, showimport itertools source = ColumnDataSource(sighting_date)p = figure(title= "Sightings by state", x_range = FactorRange(*years), x_axis_label='Years', y_axis_label='Sightings', width=1600, height=1300) items = []colors = itertools.cycle(Dark2_5) bar ={}line ={}for (indx,i),color in zip(enumerate(list(sighting_date.columns)),colors): p.vbar(x='years', top=i, source= sighting_date, line_color='black', color=color, muted_alpha=0.2, legend_label=i, muted = True) p.line(x = df_pop_rel_T.index, y = sighting_date[i], legend_label=i, line_width=2, color = color) legend = Legend(items=items, location=(0, 0)) p.add_layout(legend, 'right') p.legend.click_policy="hide"show(p) xxxxxxxxxxfrom bokeh.plotting import figure, output_file, save# set output to static HTML fileoutput_file(filename="Interactive.html", title="Static HTML file")save(p)xxxxxxxxxxdfxxxxxxxxxxsighting_world = pd.DataFrame(df.groupby(['Year','country'])['country'].count().unstack(level=1))sighting_world = sighting_world.fillna(0)sighting_worldxxxxxxxxxxdf_rest = df[df['country'] != 'us']xxxxxxxxxximport matplotlib.pyplot as pltimport numpy as npimport seaborn as snssns.set()fig = plt.figure(figsize =(15, 10))plt.subplot(2, 2, 1)hist, bins = np.histogram(df["Year"], bins = 50)plt.title("UFO sightings per year in all represented countries")plt.xlabel("Years")plt.ylabel("UFO sightings")plt.bar(bins[:-1],hist)plt.subplot(2, 2, 2)hist, bins = np.histogram(df_us["Year"], bins = 50)plt.title("UFO sightings per year in the United States of America")plt.xlabel("Years")plt.ylabel("UFO sightings")plt.bar(bins[:-1],hist)plt.subplot(2, 2, 3)hist, bins = np.histogram(df_ca["Year"], bins = 50)plt.title("UFO sightings per year in California")plt.xlabel("Years")plt.ylabel("UFO sightings")plt.bar(bins[:-1],hist)plt.subplot(2, 2, 4)hist, bins = np.histogram(df_rest["Year"], bins = 50)plt.title("UFO sightings per year excluding America")plt.xlabel("Years")plt.ylabel("UFO sightings")plt.bar(bins[:-1],hist)plt.show()xxxxxxxxxxfor x in range(len(df)): if df['Date_time'][x].split(' ')[1].split(':')[0] == '24': df['Date_time'][x] = df['Date_time'][x].split(' ')[0]+' '+'00:'+df['Date_time'][x].split(' ')[1].split(':')[1] xxxxxxxxxxweekday_dict = {0:'Monday',1:'Tuesday',2:'Wednesday',3:'Thursday',4:'Friday',5:'Saturday',6:'Sunday'}df['Day_of_week'] = [weekday_dict[pd.to_datetime(df['Date_time'][x]).weekday()] for x in range(len(df['Date_time']))]xxxxxxxxxxweekday_grp = pd.DataFrame(df.groupby(['Day_of_week'],sort = False)['Day_of_week'].count().reindex(weekday_dict.values()))weekday_grpxxxxxxxxxxsns.set()fig = plt.figure(figsize =(15, 10))plt.bar(weekday_grp.index, weekday_grp['Day_of_week'], color ='green', width = 0.4)plt.title("UFO sightings per weekday")plt.xlabel("Weekdays")plt.ylabel("UFO sightings")plt.show()xxxxxxxxxxdf['Month'] = [int(df['Date_time'][x].split('/')[0]) for x in range(len(df['Date_time']))]month_grp = pd.DataFrame(df.groupby(['Month'])['Month'].count())month_grp =month_grp.sort_index(ascending=True)month_grp.index = ['Jan', 'Feb', 'Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']xxxxxxxxxxsns.set()fig = plt.figure(figsize =(15, 10))plt.bar(month_grp.index, month_grp['Month'], color ='green', width = 0.4)plt.title("UFO sightings per Month")plt.xlabel("Months")plt.ylabel("UFO sightings")plt.show()xxxxxxxxxxdf['Time_hour'] = [df['Date_time'][x].split(' ')[1].split(':')[0] for x in range(len(df['Date_time']))]hour_grp = pd.DataFrame(df.groupby(['Time_hour'],sort = False)['Time_hour'].count())hour_grp =hour_grp.sort_index(ascending=True)xxxxxxxxxxsns.set()fig = plt.figure(figsize =(15, 10))plt.bar(hour_grp.index, hour_grp['Time_hour'], color ='green', width = 0.4)plt.title("UFO sightings per 24-hour cycle")plt.xlabel("Time in hours in the 24 hour cycle")plt.ylabel("UFO sightings")plt.show()xxxxxxxxxxdf['Day'] = [int(df['Date_time'][x].split('/')[1]) for x in range(len(df['Date_time']))]day_grp = pd.DataFrame(df.groupby(['Day'],sort = False)['Day'].count())day_grp =day_grp.sort_index(ascending=True)sns.set()fig = plt.figure(figsize =(15, 10))plt.subplot(2, 2, 1)plt.bar(month_grp.index, month_grp['Month'], color ='green', width = 0.4)plt.title("UFO sightings per Month")plt.xlabel("Months")plt.ylabel("UFO sightings")plt.subplot(2, 2, 2)plt.bar(weekday_grp.index, weekday_grp['Day_of_week'], color ='green', width = 0.4)plt.title("UFO sightings per weekday")plt.xlabel("Weekdays")plt.subplot(2, 2, 3)plt.bar(day_grp.index, day_grp['Day'], color ='green', width = 0.4)plt.title("UFO sightings per date of the month")plt.xlabel("Date of the month")plt.ylabel("UFO sightings")plt.subplot(2, 2, 4)plt.bar(hour_grp.index, hour_grp['Time_hour'], color ='green', width = 0.4)plt.title("UFO sightings per 24-hour cycle")plt.xlabel("Time in hours in the 24 hour cycle")plt.tight_layout()plt.show()